Lista 1¶

1. Zapoznanie się z opisem zbioru danych, wybór odpowiedniego zakresu danych, eksploracyjna analiza danych.¶

zapoznanie się z opisem zbioru danych¶

Do analizy oraz przetworzenia danych wybrano zbiór danych pochodzący z 4 roku prognozy, ponieważ ma najwięcej spółek, które osiągneły bankructwo.

  • Dane zawierają wskaźniki finansowe z 4 roku okresu prognozy i odpowiednią etykietę klasy, która wskazuje status bankructwa po 2 latach.
  • Dane zawierają 9792 instancje (sprawozdania finansowe), 515 reprezentuje upadłe firmy, 9277 firm, które nie zbankrutowały w okresie prognozy.
In [ ]:
from scipy.io import arff
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 200)
file_path= 'pcbd/4year.arff'

data, meta = arff.loadarff(file_path)

print(meta)
Dataset: '1year-weka.filters.unsupervised.instance.SubsetByExpression-Enot
	Attr1's type is numeric
	Attr2's type is numeric
	Attr3's type is numeric
	Attr4's type is numeric
	Attr5's type is numeric
	Attr6's type is numeric
	Attr7's type is numeric
	Attr8's type is numeric
	Attr9's type is numeric
	Attr10's type is numeric
	Attr11's type is numeric
	Attr12's type is numeric
	Attr13's type is numeric
	Attr14's type is numeric
	Attr15's type is numeric
	Attr16's type is numeric
	Attr17's type is numeric
	Attr18's type is numeric
	Attr19's type is numeric
	Attr20's type is numeric
	Attr21's type is numeric
	Attr22's type is numeric
	Attr23's type is numeric
	Attr24's type is numeric
	Attr25's type is numeric
	Attr26's type is numeric
	Attr27's type is numeric
	Attr28's type is numeric
	Attr29's type is numeric
	Attr30's type is numeric
	Attr31's type is numeric
	Attr32's type is numeric
	Attr33's type is numeric
	Attr34's type is numeric
	Attr35's type is numeric
	Attr36's type is numeric
	Attr37's type is numeric
	Attr38's type is numeric
	Attr39's type is numeric
	Attr40's type is numeric
	Attr41's type is numeric
	Attr42's type is numeric
	Attr43's type is numeric
	Attr44's type is numeric
	Attr45's type is numeric
	Attr46's type is numeric
	Attr47's type is numeric
	Attr48's type is numeric
	Attr49's type is numeric
	Attr50's type is numeric
	Attr51's type is numeric
	Attr52's type is numeric
	Attr53's type is numeric
	Attr54's type is numeric
	Attr55's type is numeric
	Attr56's type is numeric
	Attr57's type is numeric
	Attr58's type is numeric
	Attr59's type is numeric
	Attr60's type is numeric
	Attr61's type is numeric
	Attr62's type is numeric
	Attr63's type is numeric
	Attr64's type is numeric
	class's type is nominal, range is ('0', '1')

Atrybut class jest etykietą klasy, która wskazuje status bankructwa po 2 latach. Wartość 1 oznacza, że firma zbankrutowała, a wartość 0 oznacza, że firma nie zbankrutowała. Atrybut class jest typem nominalnym, zatem nie ma sensu znajdowania średniej lub mediany.

Wszystkie inne atrybuty są numeryczne. Według źródła drugiego pewne atrybuty są liczbami całkowitymi a reszta jest rzeczywistymi. A6, A59 są atrybutami całkowitymi, a reszta jest rzeczywista.

  • X6 retained earnings / total assets
  • X59 long-term liabilities / equity
In [ ]:
df = pd.DataFrame(data)

df.describe()
Out[ ]:
Attr1 Attr2 Attr3 Attr4 Attr5 Attr6 Attr7 Attr8 Attr9 Attr10 Attr11 Attr12 Attr13 Attr14 Attr15 Attr16 Attr17 Attr18 Attr19 Attr20 Attr21 Attr22 Attr23 Attr24 Attr25 Attr26 Attr27 Attr28 Attr29 Attr30 Attr31 Attr32 Attr33 Attr34 Attr35 Attr36 Attr37 Attr38 Attr39 Attr40 Attr41 Attr42 Attr43 Attr44 Attr45 Attr46 Attr47 Attr48 Attr49 Attr50 Attr51 Attr52 Attr53 Attr54 Attr55 Attr56 Attr57 Attr58 Attr59 Attr60 Attr61 Attr62 Attr63 Attr64
count 9791.000000 9791.000000 9791.000000 9749.000000 9.771000e+03 9791.000000 9791.000000 9773.000000 9792.000000 9791.000000 9791.000000 9749.000000 9771.000000 9791.000000 9.784000e+03 9773.000000 9773.000000 9791.000000 9771.000000 9771.000000 9634.000000 9791.000000 9771.000000 9581.000000 9791.000000 9773.000000 9.151000e+03 9561.000000 9791.000000 9771.000000 9771.000000 9696.000000 9749.000000 9773.000000 9791.000000 9791.000000 5350.000000 9791.000000 9771.000000 9749.000000 9605.000000 9771.000000 9.771000e+03 9.771000e+03 9179.000000 9749.000000 9719.000000 9791.000000 9771.000000 9773.000000 9791.000000 9716.000000 9561.000000 9561.000000 9.792000e+03 9771.000000 9791.000000 9776.000000 9791.000000 9178.000000 9760.000000 9.771000e+03 9749.000000 9561.000000
mean 0.043019 0.596404 0.130959 8.136600 6.465164e+01 -0.059273 0.059446 19.884016 1.882296 0.389040 0.075417 0.210989 0.398902 0.059460 3.017681e+03 0.617918 20.976033 0.064580 -0.019081 62.704589 1.218724 0.066203 -0.070364 0.247742 0.222839 0.451115 1.115883e+03 6.725180 3.946479 5.353531 0.041258 341.625124 8.445313 4.979157 0.058091 2.077261 70.659877 0.487190 -1.072578 3.064235 0.968902 -0.371479 7.356944e+02 6.729892e+02 5.458024 7.274189 112.989701 -0.002370 -0.517222 7.085001 0.469319 10.031638 6.114681 7.402928 7.686330e+03 -0.992263 0.035022 1.133287 0.856053 118.156064 25.194430 2.015157e+03 8.660813 35.949619
std 0.359321 4.587122 4.559074 290.647281 1.475939e+04 6.812754 0.533344 698.697015 17.674650 4.590299 0.528232 74.237274 37.974787 0.533344 1.022731e+05 78.494223 698.757245 0.736143 25.583613 377.204157 5.930840 0.504481 23.889882 8.268015 4.852418 74.037751 3.143938e+04 147.963574 0.865714 340.974268 25.585724 6145.604519 69.690183 58.480776 0.483463 17.341615 621.311292 4.578432 77.056762 87.916989 41.191681 14.174896 3.283705e+04 3.281128e+04 186.414617 290.619843 1993.125597 0.525467 15.737098 287.770829 4.554869 897.307846 90.190534 146.013868 7.605261e+04 77.007971 8.945365 8.038201 26.393305 3230.316692 1099.260821 1.171461e+05 60.838202 483.318623
min -12.458000 0.000000 -445.910000 -0.045319 -3.794600e+05 -486.820000 -12.458000 -1.848200 -0.032371 -445.910000 -12.244000 -6331.800000 -1460.600000 -12.458000 -1.567500e+06 -6331.800000 0.000857 -12.458000 -1578.700000 0.000000 -1.146300 -12.244000 -1578.700000 -314.370000 -466.340000 -6331.800000 -2.590100e+05 -990.020000 -0.440090 -4940.000000 -1495.600000 0.000000 0.000000 -756.500000 -9.043100 -0.000014 -3.715000 -445.910000 -7522.000000 -8.833300 -1086.800000 -719.800000 -1.158700e+05 -1.158700e+05 -2834.900000 -6.639200 -3.630700 -13.815000 -837.860000 -0.045239 0.000000 0.000000 -1033.700000 -1033.700000 -7.132200e+05 -7522.100000 -597.420000 -30.892000 -284.380000 0.000000 -12.656000 -1.496500e+04 -0.024390 -0.000015
25% 0.001321 0.263145 0.020377 1.047000 -5.121700e+01 -0.000578 0.003004 0.428300 1.006675 0.294440 0.009457 0.007608 0.021204 0.003008 2.173550e+02 0.061874 1.448900 0.003008 0.001937 15.244500 0.920125 0.000000 0.000963 0.004049 0.135600 0.057772 0.000000e+00 0.035097 3.398450 0.082730 0.004434 47.222750 2.702300 0.297400 0.000872 1.030900 1.097050 0.419750 0.000687 0.051234 0.025029 0.000000 6.890100e+01 3.651950e+01 0.010131 0.613890 15.836000 -0.047398 -0.035386 0.768760 0.185285 0.128978 0.684900 0.946990 2.184000e+01 0.003121 0.008768 0.885722 0.000000 5.356325 4.267700 4.323400e+01 2.938800 2.012900
50% 0.041364 0.467740 0.199290 1.591800 -5.557600e-02 0.000000 0.048820 1.088700 1.161300 0.510450 0.062544 0.143690 0.063829 0.048859 9.065550e+02 0.219130 2.134600 0.048859 0.030967 35.657000 1.045700 0.050118 0.025931 0.149150 0.386750 0.199240 1.005500e+00 0.470770 3.976400 0.226950 0.037939 80.884500 4.467900 1.975500 0.046855 1.559400 3.110400 0.613270 0.030427 0.182500 0.089499 0.032384 1.034800e+02 5.786200e+01 0.232560 1.041100 38.482000 0.006990 0.004644 1.230000 0.336680 0.221380 1.211800 1.378300 9.503300e+02 0.043679 0.098026 0.958305 0.002129 9.482000 6.283550 7.472900e+01 4.848900 4.041600
75% 0.111130 0.689255 0.410670 2.880400 5.573200e+01 0.065322 0.126940 2.691000 1.970225 0.714290 0.140805 0.513820 0.127935 0.126960 2.412500e+03 0.599580 3.781500 0.126960 0.083315 65.667500 1.208925 0.126675 0.071692 0.360600 0.614030 0.542190 5.236100e+00 1.570000 4.498550 0.434085 0.093553 133.202500 7.627100 4.509700 0.126375 2.284250 12.239750 0.777415 0.080855 0.666480 0.215980 0.081628 1.481050e+02 8.510950e+01 0.816100 1.971200 70.936000 0.084418 0.051082 2.268800 0.530775 0.364155 2.274200 2.426300 4.694550e+03 0.117170 0.242680 0.996163 0.211790 19.506000 9.938200 1.233450e+02 8.363800 9.413500
max 20.482000 446.910000 22.769000 27146.000000 1.034100e+06 322.200000 38.618000 53209.000000 1704.800000 12.602000 38.618000 3340.900000 2707.700000 38.618000 8.085500e+06 4401.300000 53210.000000 50.266000 1082.600000 26606.000000 396.160000 38.618000 879.860000 400.590000 12.602000 3594.600000 2.037300e+06 11864.000000 9.651800 29526.000000 1083.100000 385590.000000 5534.100000 4260.200000 38.618000 1704.800000 24487.000000 12.602000 112.020000 8007.100000 3443.400000 160.110000 3.020000e+06 3.020000e+06 10337.000000 27146.000000 140990.000000 33.535000 107.680000 27146.000000 446.910000 88433.000000 4784.100000 11678.000000 6.123700e+06 112.020000 226.760000 668.750000 1661.000000 251570.000000 108000.000000 1.077900e+07 5662.400000 21153.000000
In [ ]:
sns.countplot(x='class', data=df, hue='class')

plt.title('Class Distribution')
plt.legend(loc='upper right', title='class', labels=['Non-bankcrupt', 'Bankcrupt'])
plt.show()
No description has been provided for this image

Obserwacja:

Widać, że dane ze względu na klasyfikacje nie są równomiernie podzielone.

In [ ]:
# make a pairplot of missing values for each feature
plt.figure(figsize=(16, 10))
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Values')
plt.xlabel('Features')
plt.ylabel('Data Points')
plt.show()

df.isnull().sum().sort_values(ascending=False)
No description has been provided for this image
Out[ ]:
Attr37    4442
Attr27     641
Attr60     614
Attr45     613
Attr64     231
Attr53     231
Attr28     231
Attr54     231
Attr24     211
Attr41     187
Attr21     158
Attr32      96
Attr52      76
Attr47      73
Attr33      43
Attr46      43
Attr4       43
Attr40      43
Attr63      43
Attr12      43
Attr61      32
Attr30      21
Attr44      21
Attr43      21
Attr42      21
Attr5       21
Attr39      21
Attr49      21
Attr62      21
Attr31      21
Attr20      21
Attr19      21
Attr13      21
Attr23      21
Attr56      21
Attr16      19
Attr17      19
Attr34      19
Attr50      19
Attr26      19
Attr8       19
Attr58      16
Attr15       8
Attr51       1
Attr57       1
Attr59       1
Attr48       1
Attr1        1
Attr38       1
Attr36       1
Attr3        1
Attr6        1
Attr7        1
Attr10       1
Attr11       1
Attr14       1
Attr18       1
Attr22       1
Attr25       1
Attr29       1
Attr2        1
Attr35       1
Attr55       0
Attr9        0
class        0
dtype: int64

Po wypisaniu ile jest wartości brakujących oraz uporządkowaniu ich malejąco możemy zauważuć, że atrybut 37 ma najwięcej wartości brakujących równych 4442. Jest to dokładnie 4442/9792 = 45.4% wartości brakujących, Decyzja o usunięciu atrybutu 37 została podjęta na podstawie tego, że jest to zbyt duża ilość wartości brakujących.

Kolejny atrybut 27 który ma 641 wartości brakujących, co stanowi 641/9792 = 6.5% wartości brakujących. Z racji tego, że jest to niewielka ilość wartości brakujących, przebadamy czy jednak wartości te są istotne dla naszej analizy.

In [ ]:
print(df[df['Attr27'].isnull()]['class'].value_counts())
class
b'0'    487
b'1'    154
Name: count, dtype: int64

Okazuje się, że te przypadki posiadające brak wartości atrybutu 27, posiadają aż 154 etykiety bankructwa, co Stanowi 154/515 = 29% danych gdzie nastąpiło bankructwo W całym zbiorze, dlatego nie będziemy usuwać rekordów ze zbioru.

Kolejno atrybuty z brakami...

In [ ]:
print(df[df['Attr60'].isnull()]['class'].value_counts())
class
b'0'    561
b'1'     53
Name: count, dtype: int64
In [ ]:
print(df[df['Attr45'].isnull()]['class'].value_counts())
class
b'0'    560
b'1'     53
Name: count, dtype: int64
In [ ]:
print(df[df['Attr64'].isnull()]['class'].value_counts())
class
b'0'    203
b'1'     28
Name: count, dtype: int64

Kolejne atrybuty zawieraja również rekordy związane z bankructwe, dlatego nie będziemy usuwać rekordów ze zbioru ani usuwać kolumny, a jedynie uzupełnimy braki w danych.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,35))
sns.boxplot(data=df, orient="h")
plt.title('Boxplot of Features')
plt.xlabel('Values')
plt.ylabel('Features')
plt.show()
No description has been provided for this image

Po tym zestawieniu wykresów typu Box, od razu widać, że mamy do czynienia z wartościami odstającymi w poszczególych atrybutach.

Dodatkowo Boxploty są praktycznie niewidoczne, co oznacza, że mamy do czynienia z wartościami odstającymi w każdym atrybucie.

In [ ]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df['Attr1'], orient="h")
plt.title('Attr1 Boxplot')
plt.xlabel('Values')
plt.ylabel('Attr1')
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df['Attr11'], orient="h")
plt.title('Attr11 Boxplot')
plt.xlabel('Values')
plt.ylabel('Attr11')
plt.show()
No description has been provided for this image

Nawet dla przykładowych atrybutów, które mają niską wartość odchylenia standardowego, jak np. atrybut 1, widać, że mamy do czynienia z wartościami odstającymi i boxploty są już trochę bardziej widoczne ale nadal wąskie.

2. Wyczyszczenie zbioru danych o brakujące wartości, wartości ujemne, różna skala wartości atrybutów (normalizacja, standaryzacja). Patrz literatura [4, 5].¶

Decyzja o usunięciu atrybutu 37 została podjęta na podstawie tego, że jest to zbyt duża ilość wartości brakujących, co stanowi 45.4% wartości brakujących, co jest zbyt dużą ilością, aby uzupełniać braki w danych.

In [ ]:
df = df.drop('Attr27', axis=1)

Braki w kolejnych atrybutach to maksymalnie 641/9792 = 6.5% wartości brakujących, co jest niewielką ilością, dlatego zdecydowano się na uzupełnienie brakujących wartości.

In [ ]:
missing_columns = df.columns[df.isnull().any()].tolist()
In [ ]:
from sklearn.impute import SimpleImputer

attributes = missing_columns

for attr in attributes:
    skewness = df[attr].skew()
    imputer = SimpleImputer(strategy='mean' if abs(skewness) < 0.5 else 'median')
    df[attr] = imputer.fit_transform(df[[attr]])
    if abs(skewness) < 0.5:
        print(f"Atrybut {attr} jest symetryczny. Skośność: {skewness}. Zastąp brakujące wartości średnią.")
    else:
        if skewness > 0:
            print(f"Atrybut {attr} jest prawostronnie skośny. Skośność: {skewness}. Zastąp brakujące wartości medianą.")
        else:
            print(f"Atrybut {attr} jest lewostronnie skośny. Skośność: {skewness}. Zastąp brakujące wartości medianą.")
Atrybut Attr1 jest prawostronnie skośny. Skośność: 13.467518228865412. Zastąp brakujące wartości medianą.
Atrybut Attr2 jest prawostronnie skośny. Skośność: 94.23602939482838. Zastąp brakujące wartości medianą.
Atrybut Attr3 jest lewostronnie skośny. Skośność: -95.70656988633446. Zastąp brakujące wartości medianą.
Atrybut Attr4 jest prawostronnie skośny. Skośność: 86.06915280975713. Zastąp brakujące wartości medianą.
Atrybut Attr5 jest prawostronnie skośny. Skośność: 43.835753902533675. Zastąp brakujące wartości medianą.
Atrybut Attr6 jest lewostronnie skośny. Skośność: -21.697536070930056. Zastąp brakujące wartości medianą.
Atrybut Attr7 jest prawostronnie skośny. Skośność: 42.749747697328274. Zastąp brakujące wartości medianą.
Atrybut Attr8 jest prawostronnie skośny. Skośność: 58.85986014465329. Zastąp brakujące wartości medianą.
Atrybut Attr10 jest lewostronnie skośny. Skośność: -94.02810740094908. Zastąp brakujące wartości medianą.
Atrybut Attr11 jest prawostronnie skośny. Skośność: 44.26663626067478. Zastąp brakujące wartości medianą.
Atrybut Attr12 jest lewostronnie skośny. Skośność: -54.69677552071819. Zastąp brakujące wartości medianą.
Atrybut Attr13 jest prawostronnie skośny. Skośność: 36.073219282723755. Zastąp brakujące wartości medianą.
Atrybut Attr14 jest prawostronnie skośny. Skośność: 42.74961976122516. Zastąp brakujące wartości medianą.
Atrybut Attr15 jest prawostronnie skośny. Skośność: 55.47711843228474. Zastąp brakujące wartości medianą.
Atrybut Attr16 jest lewostronnie skośny. Skośność: -35.628273578296714. Zastąp brakujące wartości medianą.
Atrybut Attr17 jest prawostronnie skośny. Skośność: 58.84945396150125. Zastąp brakujące wartości medianą.
Atrybut Attr18 jest prawostronnie skośny. Skośność: 48.64885332661505. Zastąp brakujące wartości medianą.
Atrybut Attr19 jest lewostronnie skośny. Skośność: -12.356586608052332. Zastąp brakujące wartości medianą.
Atrybut Attr20 jest prawostronnie skośny. Skośność: 48.70550707367376. Zastąp brakujące wartości medianą.
Atrybut Attr21 jest prawostronnie skośny. Skośność: 56.5556255643253. Zastąp brakujące wartości medianą.
Atrybut Attr22 jest prawostronnie skośny. Skośność: 46.26921466931942. Zastąp brakujące wartości medianą.
Atrybut Attr23 jest lewostronnie skośny. Skośność: -24.09197817879497. Zastąp brakujące wartości medianą.
Atrybut Attr24 jest prawostronnie skośny. Skośność: 14.60678650386577. Zastąp brakujące wartości medianą.
Atrybut Attr25 jest lewostronnie skośny. Skośność: -91.08910857612695. Zastąp brakujące wartości medianą.
Atrybut Attr26 jest lewostronnie skośny. Skośność: -52.2761087355474. Zastąp brakujące wartości medianą.
Atrybut Attr28 jest prawostronnie skośny. Skośność: 60.338643118714984. Zastąp brakujące wartości medianą.
Atrybut Attr29 jest symetryczny. Skośność: -0.09216803975722088. Zastąp brakujące wartości średnią.
Atrybut Attr30 jest prawostronnie skośny. Skośność: 70.68294257224264. Zastąp brakujące wartości medianą.
Atrybut Attr31 jest lewostronnie skośny. Skośność: -5.281146969351796. Zastąp brakujące wartości medianą.
Atrybut Attr32 jest prawostronnie skośny. Skośność: 45.54528427755971. Zastąp brakujące wartości medianą.
Atrybut Attr33 jest prawostronnie skośny. Skośność: 66.02578140181893. Zastąp brakujące wartości medianą.
Atrybut Attr34 jest prawostronnie skośny. Skośność: 63.65522769545239. Zastąp brakujące wartości medianą.
Atrybut Attr35 jest prawostronnie skośny. Skośność: 53.0930250381841. Zastąp brakujące wartości medianą.
Atrybut Attr36 jest prawostronnie skośny. Skośność: 96.74860440312183. Zastąp brakujące wartości medianą.
Atrybut Attr37 jest prawostronnie skośny. Skośność: 23.70329591828747. Zastąp brakujące wartości medianą.
Atrybut Attr38 jest lewostronnie skośny. Skośność: -94.8067253633291. Zastąp brakujące wartości medianą.
Atrybut Attr39 jest lewostronnie skośny. Skośność: -95.36281584763287. Zastąp brakujące wartości medianą.
Atrybut Attr40 jest prawostronnie skośny. Skośność: 80.15908074330225. Zastąp brakujące wartości medianą.
Atrybut Attr41 jest prawostronnie skośny. Skośność: 60.62886605451784. Zastąp brakujące wartości medianą.
Atrybut Attr42 jest lewostronnie skośny. Skośność: -40.35845415459756. Zastąp brakujące wartości medianą.
Atrybut Attr43 jest prawostronnie skośny. Skośność: 82.0557315870161. Zastąp brakujące wartości medianą.
Atrybut Attr44 jest prawostronnie skośny. Skośność: 82.21184494133858. Zastąp brakujące wartości medianą.
Atrybut Attr45 jest prawostronnie skośny. Skośność: 44.45412252522891. Zastąp brakujące wartości medianą.
Atrybut Attr46 jest prawostronnie skośny. Skośność: 86.10213600771331. Zastąp brakujące wartości medianą.
Atrybut Attr47 jest prawostronnie skośny. Skośność: 52.18426409036969. Zastąp brakujące wartości medianą.
Atrybut Attr48 jest prawostronnie skośny. Skośność: 25.33432634199644. Zastąp brakujące wartości medianą.
Atrybut Attr49 jest lewostronnie skośny. Skośność: -39.19660551283668. Zastąp brakujące wartości medianą.
Atrybut Attr50 jest prawostronnie skośny. Skośność: 88.22880782942197. Zastąp brakujące wartości medianą.
Atrybut Attr51 jest prawostronnie skośny. Skośność: 96.24612936138438. Zastąp brakujące wartości medianą.
Atrybut Attr52 jest prawostronnie skośny. Skośność: 98.51871649091412. Zastąp brakujące wartości medianą.
Atrybut Attr53 jest prawostronnie skośny. Skośność: 32.90539820293406. Zastąp brakujące wartości medianą.
Atrybut Attr54 jest prawostronnie skośny. Skośność: 60.128249480894056. Zastąp brakujące wartości medianą.
Atrybut Attr56 jest lewostronnie skośny. Skośność: -95.54692007280859. Zastąp brakujące wartości medianą.
Atrybut Attr57 jest lewostronnie skośny. Skośność: -45.61710531854953. Zastąp brakujące wartości medianą.
Atrybut Attr58 jest prawostronnie skośny. Skośność: 66.37566701529. Zastąp brakujące wartości medianą.
Atrybut Attr59 jest prawostronnie skośny. Skośność: 48.447640120864264. Zastąp brakujące wartości medianą.
Atrybut Attr60 jest prawostronnie skośny. Skośność: 65.49077002364425. Zastąp brakujące wartości medianą.
Atrybut Attr61 jest prawostronnie skośny. Skośność: 97.17717655626959. Zastąp brakujące wartości medianą.
Atrybut Attr62 jest prawostronnie skośny. Skośność: 83.68270608250617. Zastąp brakujące wartości medianą.
Atrybut Attr63 jest prawostronnie skośny. Skośność: 83.75994654229625. Zastąp brakujące wartości medianą.
Atrybut Attr64 jest prawostronnie skośny. Skośność: 32.61179168566096. Zastąp brakujące wartości medianą.
Atrybut Attr62 jest prawostronnie skośny. Skośność: 83.68270608250617. Zastąp brakujące wartości medianą.
Atrybut Attr63 jest prawostronnie skośny. Skośność: 83.75994654229625. Zastąp brakujące wartości medianą.
Atrybut Attr64 jest prawostronnie skośny. Skośność: 32.61179168566096. Zastąp brakujące wartości medianą.
In [ ]:
plt.figure(figsize=(16, 10))
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Values after Imputation')
plt.xlabel('Features')
plt.ylabel('Data Points')
plt.show()

df.isnull().sum().sort_values(ascending=False)
No description has been provided for this image
Out[ ]:
Attr1     0
Attr2     0
Attr36    0
Attr37    0
Attr38    0
Attr39    0
Attr40    0
Attr41    0
Attr42    0
Attr43    0
Attr44    0
Attr45    0
Attr46    0
Attr47    0
Attr48    0
Attr49    0
Attr50    0
Attr51    0
Attr52    0
Attr53    0
Attr54    0
Attr55    0
Attr56    0
Attr57    0
Attr58    0
Attr59    0
Attr60    0
Attr61    0
Attr62    0
Attr63    0
Attr64    0
Attr35    0
Attr34    0
Attr33    0
Attr16    0
Attr3     0
Attr4     0
Attr5     0
Attr6     0
Attr7     0
Attr8     0
Attr9     0
Attr10    0
Attr11    0
Attr12    0
Attr13    0
Attr14    0
Attr15    0
Attr17    0
Attr32    0
Attr18    0
Attr19    0
Attr20    0
Attr21    0
Attr22    0
Attr23    0
Attr24    0
Attr25    0
Attr26    0
Attr28    0
Attr29    0
Attr30    0
Attr31    0
class     0
dtype: int64

Sprawdzenie czy dane posiadają brakujące wartości

Uwaga: Wartości ujemne¶

Po przeanalizowaniu pochodzenia wartości ujemny w poszczególnych atrybutach można wywnioskować, nie są to wartości nieprawidłowe, ponieważ są to wskaźniki finansowe, które mogą być ujemne, np. wartość zysku netto może być ujemna, co oznacza stratę netto czy inne wskazniki finansowe.

Jeżeli zbiór danych będzie wykorzystywany do klasyfikacji, trzeba wziąć pod uwagę fakt, że wartości ujemne mogą wpływać na wyniki klasyfikacji, ponieważ niektóre modele klasyfikacyjne nie radzą sobie z wartościami ujemnymi. Normalizacja i standaryzacja danych może pomóc w rozwiązaniu tego problemu, aczkolwiek

Normalizacja vs Standaryzacja¶

Czym się różni normalizacja od standaryzacji?

Normalizacja i standaryzacja to dwa różne sposoby przekształcania wartości atrybutów, aby uzyskać wartości w określonym zakresie.

Normalizacja polega na przekształceniu wartości atrybutów w zakresie od 0 do 1. (wariant MinMaxScaler w scikit-learn)

Standaryzacja polega na przekształceniu wartości atrybutów w taki sposób, aby miały średnią 0 i odchylenie standardowe 1.

In [ ]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

df_normalized = df.copy()
df_standardized = df.copy()

df_normalized = df_normalized.drop('class', axis=1)

df_standardized = df_standardized.drop('class', axis=1)

scaler = StandardScaler()
normalizer = MinMaxScaler()

df_normalized[df_normalized.columns] = normalizer.fit_transform(df_normalized)
df_standardized[df_standardized.columns] = scaler.fit_transform(df_standardized)

df_normalized['class'] = df['class']
df_standardized['class'] = df['class']

3. Analiza wartości odstających z wykorzystaniem np. Z-Score lub jednego z algorytmów Outlier Detection. Patrz literatura [6].¶

Do analizy wartości został użyty algorytm Isolation Forest, który jest jednym z algorytmów wykrywania wartości odstających.

Isolation Forest jest algorytmem wykrywania wartości odstających, który wykorzystuje drzewa decyzyjne do wykrywania wartości odstających.

Algorytm ten działa na zasadzie izolacji wartości odstających od reszty danych. Algorytm ten jest szybki i skuteczny w wykrywaniu wartości odstających.

In [ ]:
from sklearn.ensemble import IsolationForest

clf = IsolationForest(random_state=0, contamination=0.1) #W tym przypadku jest to 0.1, co oznacza, że oczekuje się, że 10% obserwacji będzie odstających.
clf.fit(df_normalized)
y_pred = clf.predict(df_normalized)
df_norm_outliers = df_normalized.copy()

df_norm_outliers['outliers'] = y_pred
df_norm_outliers['outliers'].value_counts()

outliers = df_norm_outliers[df_norm_outliers['outliers'] == -1]
inliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]

print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {inliers.shape[0]}")

print(df_norm_outliers[df_norm_outliers['outliers'] == -1]['class'].value_counts())

df_norm_outliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]
df_norm_outliers = df_norm_outliers.drop('outliers', axis=1)
Number of outliers: 980
Number of inliers: 8812
class
b'0'    854
b'1'    126
Name: count, dtype: int64

Tutaj problematyczne może być usuwanie outlierów, ponieważ wewnątrz nich zawarte są firmy, które zbankrutowały, a to jest nasz cel klasyfikacji. Usuwając wartości odstające, możemy usunąć wartości, które są kluczowe dla klasyfikacji.

In [ ]:
clf = IsolationForest(random_state=0, contamination=0.1)
clf.fit(df_standardized)
y_pred = clf.predict(df_standardized)
df_std_outliers = df_standardized.copy()

df_std_outliers['outliers'] = y_pred
df_std_outliers['outliers'].value_counts()

outliers = df_std_outliers[df_std_outliers['outliers'] == -1]
inliers = df_std_outliers[df_std_outliers['outliers'] == 1]

print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {inliers.shape[0]}")

print(df_std_outliers[df_std_outliers['outliers'] == -1]['class'].value_counts())

df_std_outliers = df_std_outliers[df_std_outliers['outliers'] == 1]
df_std_outliers = df_std_outliers.drop('outliers', axis=1)
Number of outliers: 980
Number of inliers: 8812
class
b'0'    854
b'1'    126
Name: count, dtype: int64

Dla mechanizmu detekcji outlierów nie ma znaczenia jakie przeskalowanie było użyte.

Kolejnym rozwiązaniem byłoby usuwanie wartości odstających częstszej klasy.

In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_normalized, orient="h", ax=axes[0])
sns.boxplot(data=df_norm_outliers, orient="h", ax=axes[1])
axes[0].set_title('Normalized with outliers (contamination=0.1)')
axes[1].set_title('Normalized without outliers (contamination=0.1)')
plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_standardized, orient="h", ax=axes[0])
sns.boxplot(data=df_std_outliers, orient="h", ax=axes[1])
axes[0].set_title('Standardized with outliers (contamination=0.1)')
axes[1].set_title('Standardized without outliers (contamination=0.1)')
plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
sns.boxplot(data=df_normalized['Attr29'], orient="h", ax=axes[0])
sns.boxplot(data=df_norm_outliers['Attr29'], orient="h", ax=axes[1])
axes[0].set_title('Attr29 with outliers (contamination=0.1) Boxplot')
axes[1].set_title('Attr29 inliers (contamination=0.1) Boxplot')
axes[0].set_xlabel('Values')
axes[0].set_ylabel('Attr29')
axes[1].set_xlabel('Values')
axes[1].set_ylabel('Attr29')
plt.show()
No description has been provided for this image

Isolation Forest 0.3¶

In [ ]:
from sklearn.ensemble import IsolationForest

clf = IsolationForest(random_state=0, contamination=0.3)
clf.fit(df_normalized)
y_pred = clf.predict(df_normalized)
df_norm_outliers = df_normalized.copy()

df_norm_outliers['outliers'] = y_pred
df_norm_outliers['outliers'].value_counts()

outliers = df_norm_outliers[df_norm_outliers['outliers'] == -1]
df_norm_inliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]

print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {df_norm_inliers.shape[0]}")

print(df_norm_outliers[df_norm_outliers['outliers'] == -1]['class'].value_counts())

df_norm_outliers = df_norm_outliers[df_norm_outliers['outliers'] == 1]
df_norm_outliers = df_norm_outliers.drop('outliers', axis=1)
df_norm_inliers = df_norm_inliers.drop('outliers', axis=1)
Number of outliers: 2938
Number of inliers: 6854
class
b'0'    2647
b'1'     291
Name: count, dtype: int64
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_normalized, orient="h", ax=axes[0])
sns.boxplot(data=df_norm_inliers, orient="h", ax=axes[1])
axes[0].set_title('Normalized with outliers (contamination=0.3)')
axes[1].set_title('Normalized without outliers (contamination=0.3)')
plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(20, 5))
sns.boxplot(data=df_normalized['Attr1'], orient="h", ax=axes[0])
sns.boxplot(data=df_norm_inliers['Attr1'], orient="h", ax=axes[1])
axes[0].set_title('Attr1 with outliers Boxplot')
axes[1].set_title('Attr1 inliers (contamination=0.3) Boxplot')

axes[0].set_xlabel('Values')
axes[0].set_ylabel('Attr1')

axes[1].set_xlabel('Values')
axes[1].set_ylabel('Attr1')

plt.show()
No description has been provided for this image
In [ ]:
clf = IsolationForest(random_state=0, contamination=0.3)
clf.fit(df_standardized)
y_pred = clf.predict(df_standardized)
df_std_outliers = df_standardized.copy()

df_std_outliers['outliers'] = y_pred
df_std_outliers['outliers'].value_counts()

std_outliers = df_std_outliers[df_std_outliers['outliers'] == -1]
std_inliers = df_std_outliers[df_std_outliers['outliers'] == 1]

print(f"Number of outliers: {outliers.shape[0]}")
print(f"Number of inliers: {std_inliers.shape[0]}")

print(std_outliers[std_outliers['outliers'] == -1]['class'].value_counts())

df_std_outliers = df_std_outliers[df_std_outliers['outliers'] == 1]
std_inliers = std_inliers.drop('outliers', axis=1)
Number of outliers: 2938
Number of inliers: 6854
class
b'0'    2646
b'1'     292
Name: count, dtype: int64
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_standardized, orient="h", ax=axes[0])
sns.boxplot(data=std_inliers, orient="h", ax=axes[1])
axes[0].set_title('Standarized with outliers (contamination=0.3)')
axes[1].set_title('Standarized without outliers (contamination=0.3)')
plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))

df_norm_inliers['class'].value_counts().plot.pie(autopct='%1.1f%%', startangle=140, shadow=True, labels=['Non-bankcrupt', 'Bankcrupt'], ax=axes[0])
axes[0].set_title('Class Distribution without outliers (contamination=0.3)')

df_normalized['class'].value_counts().plot.pie(autopct='%1.1f%%', startangle=140, shadow=True, labels=['Non-bankcrupt', 'Bankcrupt'], ax=axes[1])
axes[1].set_title('Class Distribution with outliers')

plt.show()
No description has been provided for this image

Ponowna normalizacja po usunięciu outlierów:

Dla pózniejszego PCA wymiary powinny być przeskalowane, ponieważ PCA jest wrażliwa na skalę danych.

In [ ]:
df_norm_inliers_norm = df_norm_inliers.copy()
df_norm_inliers_norm = df_norm_inliers_norm.drop('class', axis=1)
df_norm_inliers_norm[df_norm_inliers_norm.columns] = normalizer.fit_transform(df_norm_inliers_norm)
df_norm_inliers_norm['class'] = df_norm_outliers['class']

fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=df_norm_outliers, orient="h", ax=axes[0])
sns.boxplot(data=df_norm_inliers_norm, orient="h", ax=axes[1])
axes[0].set_title('Normalized without outliers (contamination=0.3)')
axes[1].set_title('Normalized normalized without outliers (contamination=0.3)')
plt.show()
No description has been provided for this image
In [ ]:
# ponowana standaryzacja danych bez odstających wartości
df_std_inliers = std_inliers.copy()
df_std_inliers = df_std_inliers.drop('class', axis=1)
df_std_inliers[df_std_inliers.columns] = scaler.fit_transform(df_std_inliers)
df_std_inliers['class'] = std_inliers['class']

fig, axes = plt.subplots(1, 2, figsize=(20, 35))
sns.boxplot(data=std_inliers, orient="h", ax=axes[0])
sns.boxplot(data=df_std_inliers, orient="h", ax=axes[1])
axes[0].set_title('Standardized without outliers (contamination=0.3)')
axes[1].set_title('Standardized standardized without outliers (contamination=0.3)')
plt.show()
No description has been provided for this image

4. Analiza zbiorów danych przy wykorzystaniu dwóch algorytmów redukcji wymiarów, np. PCA, t-SNE, UMAP. Patrz literatura [7-11].¶

In [ ]:
# podział danych na dane i etykiety

# dane oryginalne
y = df['class']
x = df.drop('class', axis=1)

# dane po normalizacji
X_normalized = df_normalized.drop('class', axis=1)
# dane po standaryzacji
X_standardized = df_standardized.drop('class', axis=1)

# dane po normalizacji i usunięciu outlierów
Y_norm_inliers = df_norm_inliers['class']
X_norm_inliers = df_norm_inliers_norm.drop('class', axis=1)
# dane po standaryzacji i usunięciu outlierów
Y_std_inliers = df_std_inliers['class']
X_std_inliers = df_std_inliers.drop('class', axis=1)
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import numpy as np

pca = PCA()
pca.fit(x)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for original data: {d}")


pca_normalized = PCA()
pca_normalized.fit(X_normalized)
cumsum_normalized = np.cumsum(pca_normalized.explained_variance_ratio_)
d_normalized = np.argmax(cumsum_normalized >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for normalized data: {d_normalized}")


pca_standardized = PCA()
pca_standardized.fit(X_standardized)
cumsum_standardized = np.cumsum(pca_standardized.explained_variance_ratio_)
d_standardized = np.argmax(cumsum_standardized >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for standardized data: {d_standardized}")

pca_norm_inliers = PCA()
pca_norm_inliers.fit(X_norm_inliers)
cumsum_norm_inliers = np.cumsum(pca_norm_inliers.explained_variance_ratio_)
d_norm_inliers = np.argmax(cumsum_norm_inliers >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for normalized data without outliers: {d_norm_inliers}")

pca_std_inliers = PCA()
pca_std_inliers.fit(X_std_inliers)
cumsum_std_inliers = np.cumsum(pca_std_inliers.explained_variance_ratio_)
d_std_inliers = np.argmax(cumsum_std_inliers >= 0.95) + 1
print(f"Number of components needed to explain 95% of variance for standardized data without outliers: {d_std_inliers}")

plt.figure(figsize=(16, 5))

sns.barplot(x=np.arange(1, len(cumsum)+1), y=cumsum, label='Original', color='blue', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_normalized)+1), y=cumsum_normalized, label='Normalized', color='green', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_standardized)+1), y=cumsum_standardized, label='Standardized', color='orange', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_norm_inliers)+1), y=cumsum_norm_inliers, label='Normalized without outliers', color='red', alpha=0.5)
sns.barplot(x=np.arange(1, len(cumsum_std_inliers)+1), y=cumsum_std_inliers, label='Standardized without outliers', color='purple', alpha=0.5)


plt.axhline(y=0.95, color='r', linestyle='--')

plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.title('Explained Variance by Number of Components')
plt.legend(loc='upper right')
plt.show()
Number of components needed to explain 95% of variance for original data: 4
Number of components needed to explain 95% of variance for normalized data: 25
Number of components needed to explain 95% of variance for standardized data: 28
Number of components needed to explain 95% of variance for normalized data without outliers: 16
Number of components needed to explain 95% of variance for standardized data without outliers: 27
No description has been provided for this image

Wniosek:

Dla przeprowadzonej normalizacji wystarczy 24 składowych, które wyjaśniają 95% wariancji.

W przypadku przeprowadzenia standaryzacji, potrzeba aż 28 składowych, aby wyjaśnić 95% skumulowanej wariancji.

I tak zwizualizjmy dane za pomocą PCA w dwóch wymiarach, aby zobaczyć jak wyglądają nasze dane.

In [ ]:
print(x.info())
print(X_normalized.info())
print(X_standardized.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9792 entries, 0 to 9791
Data columns (total 63 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Attr1   9792 non-null   float64
 1   Attr2   9792 non-null   float64
 2   Attr3   9792 non-null   float64
 3   Attr4   9792 non-null   float64
 4   Attr5   9792 non-null   float64
 5   Attr6   9792 non-null   float64
 6   Attr7   9792 non-null   float64
 7   Attr8   9792 non-null   float64
 8   Attr9   9792 non-null   float64
 9   Attr10  9792 non-null   float64
 10  Attr11  9792 non-null   float64
 11  Attr12  9792 non-null   float64
 12  Attr13  9792 non-null   float64
 13  Attr14  9792 non-null   float64
 14  Attr15  9792 non-null   float64
 15  Attr16  9792 non-null   float64
 16  Attr17  9792 non-null   float64
 17  Attr18  9792 non-null   float64
 18  Attr19  9792 non-null   float64
 19  Attr20  9792 non-null   float64
 20  Attr21  9792 non-null   float64
 21  Attr22  9792 non-null   float64
 22  Attr23  9792 non-null   float64
 23  Attr24  9792 non-null   float64
 24  Attr25  9792 non-null   float64
 25  Attr26  9792 non-null   float64
 26  Attr28  9792 non-null   float64
 27  Attr29  9792 non-null   float64
 28  Attr30  9792 non-null   float64
 29  Attr31  9792 non-null   float64
 30  Attr32  9792 non-null   float64
 31  Attr33  9792 non-null   float64
 32  Attr34  9792 non-null   float64
 33  Attr35  9792 non-null   float64
 34  Attr36  9792 non-null   float64
 35  Attr37  9792 non-null   float64
 36  Attr38  9792 non-null   float64
 37  Attr39  9792 non-null   float64
 38  Attr40  9792 non-null   float64
 39  Attr41  9792 non-null   float64
 40  Attr42  9792 non-null   float64
 41  Attr43  9792 non-null   float64
 42  Attr44  9792 non-null   float64
 43  Attr45  9792 non-null   float64
 44  Attr46  9792 non-null   float64
 45  Attr47  9792 non-null   float64
 46  Attr48  9792 non-null   float64
 47  Attr49  9792 non-null   float64
 48  Attr50  9792 non-null   float64
 49  Attr51  9792 non-null   float64
 50  Attr52  9792 non-null   float64
 51  Attr53  9792 non-null   float64
 52  Attr54  9792 non-null   float64
 53  Attr55  9792 non-null   float64
 54  Attr56  9792 non-null   float64
 55  Attr57  9792 non-null   float64
 56  Attr58  9792 non-null   float64
 57  Attr59  9792 non-null   float64
 58  Attr60  9792 non-null   float64
 59  Attr61  9792 non-null   float64
 60  Attr62  9792 non-null   float64
 61  Attr63  9792 non-null   float64
 62  Attr64  9792 non-null   float64
dtypes: float64(63)
memory usage: 4.7 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9792 entries, 0 to 9791
Data columns (total 63 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Attr1   9792 non-null   float64
 1   Attr2   9792 non-null   float64
 2   Attr3   9792 non-null   float64
 3   Attr4   9792 non-null   float64
 4   Attr5   9792 non-null   float64
 5   Attr6   9792 non-null   float64
 6   Attr7   9792 non-null   float64
 7   Attr8   9792 non-null   float64
 8   Attr9   9792 non-null   float64
 9   Attr10  9792 non-null   float64
 10  Attr11  9792 non-null   float64
 11  Attr12  9792 non-null   float64
 12  Attr13  9792 non-null   float64
 13  Attr14  9792 non-null   float64
 14  Attr15  9792 non-null   float64
 15  Attr16  9792 non-null   float64
 16  Attr17  9792 non-null   float64
 17  Attr18  9792 non-null   float64
 18  Attr19  9792 non-null   float64
 19  Attr20  9792 non-null   float64
 20  Attr21  9792 non-null   float64
 21  Attr22  9792 non-null   float64
 22  Attr23  9792 non-null   float64
 23  Attr24  9792 non-null   float64
 24  Attr25  9792 non-null   float64
 25  Attr26  9792 non-null   float64
 26  Attr28  9792 non-null   float64
 27  Attr29  9792 non-null   float64
 28  Attr30  9792 non-null   float64
 29  Attr31  9792 non-null   float64
 30  Attr32  9792 non-null   float64
 31  Attr33  9792 non-null   float64
 32  Attr34  9792 non-null   float64
 33  Attr35  9792 non-null   float64
 34  Attr36  9792 non-null   float64
 35  Attr37  9792 non-null   float64
 36  Attr38  9792 non-null   float64
 37  Attr39  9792 non-null   float64
 38  Attr40  9792 non-null   float64
 39  Attr41  9792 non-null   float64
 40  Attr42  9792 non-null   float64
 41  Attr43  9792 non-null   float64
 42  Attr44  9792 non-null   float64
 43  Attr45  9792 non-null   float64
 44  Attr46  9792 non-null   float64
 45  Attr47  9792 non-null   float64
 46  Attr48  9792 non-null   float64
 47  Attr49  9792 non-null   float64
 48  Attr50  9792 non-null   float64
 49  Attr51  9792 non-null   float64
 50  Attr52  9792 non-null   float64
 51  Attr53  9792 non-null   float64
 52  Attr54  9792 non-null   float64
 53  Attr55  9792 non-null   float64
 54  Attr56  9792 non-null   float64
 55  Attr57  9792 non-null   float64
 56  Attr58  9792 non-null   float64
 57  Attr59  9792 non-null   float64
 58  Attr60  9792 non-null   float64
 59  Attr61  9792 non-null   float64
 60  Attr62  9792 non-null   float64
 61  Attr63  9792 non-null   float64
 62  Attr64  9792 non-null   float64
dtypes: float64(63)
memory usage: 4.7 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9792 entries, 0 to 9791
Data columns (total 63 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Attr1   9792 non-null   float64
 1   Attr2   9792 non-null   float64
 2   Attr3   9792 non-null   float64
 3   Attr4   9792 non-null   float64
 4   Attr5   9792 non-null   float64
 5   Attr6   9792 non-null   float64
 6   Attr7   9792 non-null   float64
 7   Attr8   9792 non-null   float64
 8   Attr9   9792 non-null   float64
 9   Attr10  9792 non-null   float64
 10  Attr11  9792 non-null   float64
 11  Attr12  9792 non-null   float64
 12  Attr13  9792 non-null   float64
 13  Attr14  9792 non-null   float64
 14  Attr15  9792 non-null   float64
 15  Attr16  9792 non-null   float64
 16  Attr17  9792 non-null   float64
 17  Attr18  9792 non-null   float64
 18  Attr19  9792 non-null   float64
 19  Attr20  9792 non-null   float64
 20  Attr21  9792 non-null   float64
 21  Attr22  9792 non-null   float64
 22  Attr23  9792 non-null   float64
 23  Attr24  9792 non-null   float64
 24  Attr25  9792 non-null   float64
 25  Attr26  9792 non-null   float64
 26  Attr28  9792 non-null   float64
 27  Attr29  9792 non-null   float64
 28  Attr30  9792 non-null   float64
 29  Attr31  9792 non-null   float64
 30  Attr32  9792 non-null   float64
 31  Attr33  9792 non-null   float64
 32  Attr34  9792 non-null   float64
 33  Attr35  9792 non-null   float64
 34  Attr36  9792 non-null   float64
 35  Attr37  9792 non-null   float64
 36  Attr38  9792 non-null   float64
 37  Attr39  9792 non-null   float64
 38  Attr40  9792 non-null   float64
 39  Attr41  9792 non-null   float64
 40  Attr42  9792 non-null   float64
 41  Attr43  9792 non-null   float64
 42  Attr44  9792 non-null   float64
 43  Attr45  9792 non-null   float64
 44  Attr46  9792 non-null   float64
 45  Attr47  9792 non-null   float64
 46  Attr48  9792 non-null   float64
 47  Attr49  9792 non-null   float64
 48  Attr50  9792 non-null   float64
 49  Attr51  9792 non-null   float64
 50  Attr52  9792 non-null   float64
 51  Attr53  9792 non-null   float64
 52  Attr54  9792 non-null   float64
 53  Attr55  9792 non-null   float64
 54  Attr56  9792 non-null   float64
 55  Attr57  9792 non-null   float64
 56  Attr58  9792 non-null   float64
 57  Attr59  9792 non-null   float64
 58  Attr60  9792 non-null   float64
 59  Attr61  9792 non-null   float64
 60  Attr62  9792 non-null   float64
 61  Attr63  9792 non-null   float64
 62  Attr64  9792 non-null   float64
dtypes: float64(63)
memory usage: 4.7 MB
None
In [ ]:
pca = PCA(n_components=2)

df_pca = pca.fit_transform(x)

df_pca = pd.DataFrame(df_pca, columns=['PC1', 'PC2'])
df_pca['class'] = y


pca = PCA(n_components=2)
df_normalized_pca = pca.fit_transform(X_normalized)
df_normalized_pca = pd.DataFrame(data=df_normalized_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column
df_normalized_pca['class'] = df['class']


pca = PCA(n_components=2)
df_standardized_pca = pca.fit_transform(X_standardized)
df_standardized_pca = pd.DataFrame(data=df_standardized_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column
df_standardized_pca['class'] = df['class']

pca = PCA(n_components=2)
df_norm_inliers_pca = pca.fit_transform(X_norm_inliers)
df_norm_inliers_pca = pd.DataFrame(data=df_norm_inliers_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column
df_norm_inliers_pca['class'] = Y_norm_inliers.values

pca = PCA(n_components=2)
df_std_inliers_pca = pca.fit_transform(X_std_inliers)
df_std_inliers_pca = pd.DataFrame(data=df_std_inliers_pca, columns=[f'PC{i}' for i in range(1, 3)])
#add class column

df_std_inliers_pca['class'] = Y_std_inliers.values
print(df_std_inliers_pca['class'].value_counts())

fig, axes = plt.subplots(1, 5, figsize=(16, 5))
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue='class', ax=axes[0])
axes[0].set_title('PCA - Original Data')
sns.scatterplot(data=df_normalized_pca, x='PC1', y='PC2', hue='class', ax=axes[1])
axes[1].set_title('Normalized data')
sns.scatterplot(data=df_standardized_pca, x='PC1', y='PC2', hue='class', ax=axes[2])
axes[2].set_title('Standardized data')
sns.scatterplot(data=df_norm_inliers_pca, x='PC1', y='PC2', hue='class', ax=axes[3])
axes[3].set_title('Normalized without outliers')
sns.scatterplot(data=df_std_inliers_pca, x='PC1', y='PC2', hue='class', ax=axes[4])
axes[4].set_title('Standardized without outliers')
fig.tight_layout()
plt.show()
class
b'0'    6631
b'1'     223
Name: count, dtype: int64
No description has been provided for this image
<Figure size 1000x500 with 0 Axes>
In [ ]:
pca = PCA(n_components=24)
df_normalized_pca = pca.fit_transform(X_normalized)
df_normalized_pca = pd.DataFrame(data=df_normalized_pca, columns=[f'PC{i}' for i in range(1, 25)])
#add class column
df_normalized_pca['class'] = df['class']


pca = PCA(n_components=28)
df_standardized_pca = pca.fit_transform(X_standardized)
df_standardized_pca = pd.DataFrame(data=df_standardized_pca, columns=[f'PC{i}' for i in range(1, 29)])
#add class column
df_standardized_pca['class'] = df['class']

pca_norm_inliers = PCA(n_components=16)
df_norm_inliers_pca = pca_norm_inliers.fit_transform(X_norm_inliers)
df_norm_inliers_pca = pd.DataFrame(data=df_norm_inliers_pca, columns=[f'PC{i}' for i in range(1, 17)])
#add class column
df_norm_inliers_pca['class'] = Y_norm_inliers.values

pca_std_inliers = PCA(n_components=27)
df_std_inliers_pca = pca_std_inliers.fit_transform(X_std_inliers)
df_std_inliers_pca = pd.DataFrame(data=df_std_inliers_pca, columns=[f'PC{i}' for i in range(1, 28)])
#add class column
df_std_inliers_pca['class'] = Y_std_inliers.values
In [ ]:
import time
from sklearn.manifold import TSNE

time_start = time.time()
perplexity = 40
n_iter = 500
random_state =  42

fig_height = 10
fig_width = 16

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_standarized = tsne.fit_transform(X_standardized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_standarized = pd.DataFrame(data=tsne_standarized, columns=['t-SNE1', 't-SNE2'])

# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_standarized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE standarized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.003s...
c:\Users\filip\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning: Could not find the number of physical cores for the following reason:
found 0 physical cores < 1
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
  warnings.warn(
  File "c:\Users\filip\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")
[t-SNE] Computed neighbors for 9792 samples in 0.714s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.156631
[t-SNE] KL divergence after 250 iterations with early exaggeration: 80.530121
[t-SNE] KL divergence after 500 iterations: 1.733346
t-SNE done! Time elapsed: 37.02339291572571 seconds
No description has been provided for this image
In [ ]:
 
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.005s...
[t-SNE] Computed neighbors for 9792 samples in 0.418s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.002874
[t-SNE] KL divergence after 250 iterations with early exaggeration: 71.813110
[t-SNE] KL divergence after 500 iterations: 1.390919
t-SNE done! Time elapsed: 84.31460404396057 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 2
n_iter = 500
fig_height = 10
fig_width = 16

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 7 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.003s...
[t-SNE] Computed neighbors for 9792 samples in 0.349s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.000791
[t-SNE] KL divergence after 250 iterations with early exaggeration: 95.807755
[t-SNE] KL divergence after 500 iterations: 2.487562
t-SNE done! Time elapsed: 56.609742641448975 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 5
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 16 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.004s...
[t-SNE] Computed neighbors for 9792 samples in 0.347s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.001513
[t-SNE] KL divergence after 250 iterations with early exaggeration: 89.871399
[t-SNE] KL divergence after 300 iterations: 3.805093
t-SNE done! Time elapsed: 27.69949507713318 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 10
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.005s...
[t-SNE] Computed neighbors for 9792 samples in 0.747s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.001920
[t-SNE] KL divergence after 250 iterations with early exaggeration: 83.811539
[t-SNE] KL divergence after 500 iterations: 1.912423
t-SNE done! Time elapsed: 61.91113305091858 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 20
n_iter = 500
tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 61 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.007s...
[t-SNE] Computed neighbors for 9792 samples in 0.597s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.002349
[t-SNE] KL divergence after 250 iterations with early exaggeration: 77.870216
[t-SNE] KL divergence after 500 iterations: 1.658014
t-SNE done! Time elapsed: 85.2188286781311 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 30
n_iter = 500

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.006s...
[t-SNE] Computed neighbors for 9792 samples in 0.561s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.002639
[t-SNE] KL divergence after 250 iterations with early exaggeration: 74.390587
[t-SNE] KL divergence after 500 iterations: 1.499874
t-SNE done! Time elapsed: 61.678648710250854 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 40
n_iter = 500

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.004s...
[t-SNE] Computed neighbors for 9792 samples in 0.668s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.002874
[t-SNE] KL divergence after 250 iterations with early exaggeration: 71.813011
[t-SNE] KL divergence after 500 iterations: 1.390330
t-SNE done! Time elapsed: 66.27242016792297 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 50
n_iter = 500

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.004s...
[t-SNE] Computed neighbors for 9792 samples in 0.797s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.003077
[t-SNE] KL divergence after 250 iterations with early exaggeration: 69.729156
[t-SNE] KL divergence after 500 iterations: 1.298629
t-SNE done! Time elapsed: 72.84613847732544 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 70
n_iter = 500

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 211 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.006s...
[t-SNE] Computed neighbors for 9792 samples in 0.696s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.003429
[t-SNE] KL divergence after 250 iterations with early exaggeration: 66.480576
[t-SNE] KL divergence after 500 iterations: 1.156768
t-SNE done! Time elapsed: 77.62437748908997 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 40
n_iter = 2000

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_normalized = tsne.fit_transform(X_normalized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_normalized = pd.DataFrame(data=tsne_normalized, columns=['t-SNE1', 't-SNE2'])

plt.figure(figsize=(16, 10))
sns.scatterplot(data=tsne_normalized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE normalized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.004s...
[t-SNE] Computed neighbors for 9792 samples in 0.435s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.002874
[t-SNE] KL divergence after 250 iterations with early exaggeration: 71.813110
[t-SNE] KL divergence after 2000 iterations: 1.085735
t-SNE done! Time elapsed: 223.09739184379578 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 40
n_iter = 500
random_state =  42

fig_height = 10
fig_width = 16

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_standarized = tsne.fit_transform(X_standardized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_standarized = pd.DataFrame(data=tsne_standarized, columns=['t-SNE1', 't-SNE2'])

# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_standarized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE standarized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.007s...
[t-SNE] Computed neighbors for 9792 samples in 0.971s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.156631
[t-SNE] KL divergence after 250 iterations with early exaggeration: 80.530121
[t-SNE] KL divergence after 500 iterations: 1.733346
t-SNE done! Time elapsed: 135.84579229354858 seconds
No description has been provided for this image
In [ ]:
time_start = time.time()
perplexity = 40
n_iter = 2000
random_state =  42

fig_height = 10
fig_width = 16

tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter, random_state=random_state)
tsne_standarized = tsne.fit_transform(X_standardized)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
tsne_standarized = pd.DataFrame(data=tsne_standarized, columns=['t-SNE1', 't-SNE2'])

# plot
plt.figure(figsize=(fig_width, fig_height))
sns.scatterplot(data=tsne_standarized, x='t-SNE1', y='t-SNE2', hue=df['class'], palette=sns.color_palette("hls", 2), alpha=0.5)
plt.title(f't-SNE standarized data (perplexity={perplexity}, n_iter={n_iter})')
plt.show()
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9792 samples in 0.013s...
[t-SNE] Computed neighbors for 9792 samples in 2.093s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9792
[t-SNE] Computed conditional probabilities for sample 2000 / 9792
[t-SNE] Computed conditional probabilities for sample 3000 / 9792
[t-SNE] Computed conditional probabilities for sample 4000 / 9792
[t-SNE] Computed conditional probabilities for sample 5000 / 9792
[t-SNE] Computed conditional probabilities for sample 6000 / 9792
[t-SNE] Computed conditional probabilities for sample 7000 / 9792
[t-SNE] Computed conditional probabilities for sample 8000 / 9792
[t-SNE] Computed conditional probabilities for sample 9000 / 9792
[t-SNE] Computed conditional probabilities for sample 9792 / 9792
[t-SNE] Mean sigma: 0.156631
[t-SNE] KL divergence after 250 iterations with early exaggeration: 80.530121
[t-SNE] KL divergence after 2000 iterations: 1.509810
t-SNE done! Time elapsed: 461.435617685318 seconds
No description has been provided for this image

T-SNE jest algorytmem redukcji wymiarów, który jest używany do wizualizacji danych w dwóch lub trzech wymiarach. Używany do wizualizacji danych w celu zrozumienia struktury danych. T-SNE w porównaniu do PCA próbuje utrzymać lokalne strutkury, klastry, a nie globalne jak PCA. Utrzymuje sąsiedztwo punktów w przestrzeni. Zwraca uawgę na odległości między punktami gwaratnuje bliskie położenie punktów należących do klastrów, aczkolwiek nie gwarantuje zachowania odległości między klastrami. Nie utrzymuje odległości między odległymi punktami. Należy TSNE wykonać wiele razy dla różnych wartości parametrów perplexity oraz liczbie iteracji. Rozkład T-Studenta pomaga w rozwiązaniu problemu "crowding problem". T-SNE rozpycha gęste klastry i ściąga rzadkie klastry, dlatego interpretacja "wielkosći" między klastrami może nie mieć sensu. Tutaj zastosowano próbę ustabilizowania ukłądu przy narastającej wartości perplexity, a potem wybrano odpowiednią liczbę iteracji. Niestety nie udało się uzyskać zadowalającego wyniku.

5. Uruchomienie wybranego modelu klasyfikacji lub grupowania. Analiza porównawcza wyników oraz decyzji podjętych w trakcie przygotowania danych do modelowania¶

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay


random_state = 42

def evaluate_model(data):
    model = DecisionTreeClassifier(random_state=random_state)
    if data['class'].dtype != 'int':
        data['class'] = data['class'].astype('int')
    y = data['class']
    data = data.drop('class', axis=1)
    X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=random_state)
    print(f"Model: {model.__class__.__name__}")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    cr = classification_report(y_test, y_pred, output_dict=True)
    return cm, cr
In [ ]:
tsne_normalized['class'] = df['class']
In [ ]:
df_orginal_cm, df_orginal_cr =  evaluate_model(df)

df_normalized_cm, df_normalized_cr =  evaluate_model(df_normalized)

df_standardized_cm, df_standardized_cr =  evaluate_model(df_standardized)

df_norm_inliers_cm, df_norm_inliers_cr =  evaluate_model(df_norm_inliers)

df_std_inliers_cm, df_std_inliers_cr =  evaluate_model(df_std_inliers)

df_norm_pca_cm, df_norm_pca_cr =  evaluate_model(df_normalized_pca)

df_std_pca_cm, df_std_pca_cr = evaluate_model(df_standardized_pca)

df_norm_inliers_pca_cm, df_norm_inliers_pca_cr =  evaluate_model(df_norm_inliers_pca)

df_std_inliers_pca_cm, df_std_inliers_pca_cr =  evaluate_model(df_std_inliers_pca)

df_tsne_normalized_cm, df_tsne_normalized_cr =  evaluate_model(tsne_normalized)
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
Model: DecisionTreeClassifier
In [ ]:
#make a subplot of all confusion matrix
fig, axes = plt.subplots(2, 5, figsize=(20, 10))

ConfusionMatrixDisplay(df_orginal_cm).plot( values_format='d', ax=axes[0][0])
axes[0][0].set_title('Original Data')

ConfusionMatrixDisplay(df_normalized_cm).plot( values_format='d', ax=axes[0][1])
axes[0][1].set_title('Normalized Data')

ConfusionMatrixDisplay(df_standardized_cm).plot( values_format='d', ax=axes[0][2])
axes[0][2].set_title('Standardized Data')

ConfusionMatrixDisplay(df_norm_inliers_cm).plot( values_format='d', ax=axes[0][3])
axes[0][3].set_title('Normalized without outliers')

ConfusionMatrixDisplay(df_std_inliers_cm).plot( values_format='d', ax=axes[0][4])
axes[0][4].set_title('Standardized without outliers')

ConfusionMatrixDisplay(df_norm_pca_cm).plot( values_format='d', ax=axes[1][0])
axes[1][0].set_title('Normalized PCA')

ConfusionMatrixDisplay(df_std_pca_cm).plot( values_format='d', ax=axes[1][1])
axes[1][1].set_title('Standardized PCA')

ConfusionMatrixDisplay(df_tsne_normalized_cm).plot( values_format='d', ax=axes[1][2])
axes[1][2].set_title('t-SNE normalized data')

ConfusionMatrixDisplay(df_norm_inliers_pca_cm).plot( values_format='d', ax=axes[1][3])
axes[1][3].set_title('Normalized without outliers PCA')

ConfusionMatrixDisplay(df_std_inliers_pca_cm).plot( values_format='d', ax=axes[1][4])
axes[1][4].set_title('Standardized without outliers PCA')


fig.tight_layout()
plt.show()


#Wniosk
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(2, 5, figsize=(20, 10))

sns.heatmap(pd.DataFrame(df_orginal_cr).iloc[:-1, :].T, annot=True, ax=axes[0][0], cmap='viridis')
axes[0][0].set_title('Original Data')

sns.heatmap(pd.DataFrame(df_normalized_cr).iloc[:-1, :].T, annot=True, ax=axes[0][1], cmap='viridis')
axes[0][1].set_title('Normalized Data')

sns.heatmap(pd.DataFrame(df_standardized_cr).iloc[:-1, :].T, annot=True, ax=axes[0][2], cmap='viridis')
axes[0][2].set_title('Standardized Data')

sns.heatmap(pd.DataFrame(df_norm_inliers_cr).iloc[:-1, :].T, annot=True, ax=axes[0][3], cmap='viridis')
axes[0][3].set_title('Normalized without outliers')

sns.heatmap(pd.DataFrame(df_std_inliers_cr).iloc[:-1, :].T, annot=True, ax=axes[0][4], cmap='viridis')
axes[0][4].set_title('Standardized without outliers')

sns.heatmap(pd.DataFrame(df_norm_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][0], cmap='viridis')
axes[1][0].set_title('Normalized PCA')

sns.heatmap(pd.DataFrame(df_std_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][1], cmap='viridis')
axes[1][1].set_title('Standardized PCA')

sns.heatmap(pd.DataFrame(df_tsne_normalized_cr).iloc[:-1, :].T, annot=True, ax=axes[1][2], cmap='viridis')
axes[1][2].set_title('t-SNE normalized data')

sns.heatmap(pd.DataFrame(df_norm_inliers_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][3], cmap='viridis')
axes[1][3].set_title('Normalized without outliers PCA')

sns.heatmap(pd.DataFrame(df_std_inliers_pca_cr).iloc[:-1, :].T, annot=True, ax=axes[1][4], cmap='viridis')
axes[1][4].set_title('Standardized without outliers PCA')

fig.tight_layout()
plt.show()
No description has been provided for this image
In [ ]: